The theory of statistical hypothesis testing was developed in the early 20th century. Among other uses,
it was designed to apply the scientific method to data sampled from populations. In the following
sections, we explain the steps of hypothesis testing, the potential results, and possible errors that can
be made when interpreting a statistical test. We also define and describe the relationships between
power, sample size, and effect size in testing.
Getting the language down
Here are some of the most common terms used in hypothesis testing:
Null hypothesis (abbreviated
): The assertion that any apparent effect you see in your data is
not evidence of a true effect in the population, but is merely the result of random fluctuations.
Alternate hypothesis (abbreviated
or
): The assertion that there indeed is evidence in
your data of a true effect in the population over and above what would be attributable to random
fluctuations.
Significance test: A calculation designed to determine whether
can reasonably explain what
you see in your data or not.
Significance: The conclusion that random fluctuations alone can’t account for the size of the effect
you observe in your data. In this case,
must be false, so you accept
.
Statistic: A number that you obtain or calculate from your sample.
Test statistic: A number calculated from your sample that is part of performing a statistical test. It
can be for the purpose of testing
. In general, the test statistic is usually calculated as the ratio of
a number that measures the size of the effect (the signal) divided by a number that measures the size
of the random fluctuations (the noise).
p value: The probability or likelihood that random fluctuations alone (in the absence of any true
effect in the population) can produce the effect observed in your sample (or, at least as large as the
effect you observe in your sample). The p value is the probability of random fluctuations making
the test statistic at least as large as what you calculate from your sample (or, more precisely, at
least as far away from
in the direction of
).
Type I error: Choosing that
is correct when in fact, no true effect above random fluctuations
is present.
Alpha (α): The probability of making a Type I error.
Type II error: Choosing that
is correct when in fact there is indeed a true effect present that
rises above random fluctuations.
Beta (β): The probability of making a Type II error.
Power: The same as 1 – β, which is probability of choosing
as correct when in fact there is a
true effect above random fluctuations present.
Testing for significance
All the common statistical significance tests, including the Student t test, chi-square, and